Skip to content

Fix launcher job scheduling directives when unsuspending#772

Open
GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
GonzaloSaez:fix_kueue_launcher_suspended
Open

Fix launcher job scheduling directives when unsuspending#772
GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
GonzaloSaez:fix_kueue_launcher_suspended

Conversation

@GonzaloSaez
Copy link
Contributor

@GonzaloSaez GonzaloSaez commented Feb 15, 2026

This should address #770.

If an MPIJob is suspended and then unsuspended (i.e. like Kueue would do during workload creation or when preemption occurs), the launcher job would not have the correct scheduling directives after launch job unsuspension. We need to perform the same operations as JobSet does: https://github.com/kubernetes-sigs/jobset/blob/f1bbaaef64b2a56c4721843b1d83750d21227948/pkg/controllers/jobset_controller.go#L537

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

@GonzaloSaez could you sign DCO?

@tenzen-y
Copy link
Member

Avoid creating the launcher job if the MPIJob starts suspended. It adds load to the apiserver for not much value.

@GonzaloSaez Could you keep the current mechanism (creating a batch/v1 Job even when the MPIJob is suspended)?
This semantic change could potentially be a breaking change that can not be released as part of the same major version.

@tenzen-y
Copy link
Member

@GonzaloSaez could you sign DCO?

You can follow https://github.com/kubeflow/mpi-operator/pull/772/checks?check_run_id=63645778871 steps to sign DCO.

Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
@GonzaloSaez GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from 880261d to fe8d324 Compare February 15, 2026 17:39
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GonzaloSaez Thank you for working on this problem.
Basically, LGTM.

Additionally, could you add an integration test case to https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

// so we must clear it first via a status sub-resource update (consistent with JobSet).
if launcher.Status.StartTime != nil {
launcher.Status.StartTime = nil
if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
var err error
if launcher, err = c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {

Could you update launcher after startTime update to avoid coflict while scheduling directive update?

// syncLauncherSchedulingDirectives updates the mutable scheduling directives (as per KEP-2926) on
// the launcher Job's pod template to match the desired template.
func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
if launcher.Spec.Template.Labels == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if launcher.Spec.Template.Labels == nil {
if desired.Labels != nil && launcher.Spec.Template.Labels == nil {

Optimizing initialization would be better.

// the launcher Job's pod template to match the desired template.
func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
if launcher.Spec.Template.Labels == nil {
launcher.Spec.Template.Labels = make(map[string]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
launcher.Spec.Template.Labels = make(map[string]string)
launcher.Spec.Template.Labels = make(map[string]string, len(desired.Labels))

Comment on lines +1655 to +1662
if desired.Annotations != nil {
if launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if desired.Annotations != nil {
if launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}
}
if desired.Annotations != nil && launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}

The range loop will be executed only when the desired.Annotaions are not null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants